Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Use non-blocking device side pointer mode in CUBLAS, with fallbacks #2616

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kshyatt
Copy link
Contributor

@kshyatt kshyatt commented Jan 10, 2025

Attempting to address #2571

I've set the pointer mode to "device side" during handle creation. Since gemmGroupedBatched doesn't support device side pointer mode, it won't be usable. One workaround for this would be to add a new function to create a handle with host side mode, or add the pointer mode as an optional kwarg to handle(). Very open to feedback on this.

I've set this up so that users can supply CuRefs of the appropriate result type to the level 1 functions for results. If that's not provided, the functions execute as they do today (synchronously). Similarly, for functions taking alpha or beta scalar arguments, if the user provides CuRef (actually a CuRefArray), the functions will execute asynchronously and return instantly. If the user provides a Number, the behaviour is unchanged from today. I'm not married to this design and it can certainly be changed.

cc @Jutho

@kshyatt kshyatt requested a review from maleadt January 10, 2025 21:03
@kshyatt kshyatt added the cuda libraries Stuff about CUDA library wrappers. label Jan 10, 2025
@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 10, 2025

I can also add some more @eval blocks to try to cut down on the repetitive fallback logic

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 10, 2025

Sample speedup:

julia> using CUDA, CUDA.CUBLAS, LinearAlgebra;

julia> n = Int(2^26);

julia> X = CUDA.rand(Float64, n);

julia> res = CuRef{Float64}(0.0);

# do some precompilation runs first

julia> @time CUBLAS.nrm2(n, X, res);
  0.000104 seconds (18 allocations: 288 bytes)

julia> @time CUBLAS.nrm2(n, X);
  0.001564 seconds (73 allocations: 3.094 KiB)

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: cc7e01f Previous: 4bec614 Ratio
latency/precompile 45488869455 ns 45396234276 ns 1.00
latency/ttfp 6280704627 ns 6416277525.5 ns 0.98
latency/import 2962175349 ns 3047951471 ns 0.97
integration/volumerhs 9569307 ns 9572210 ns 1.00
integration/byval/slices=1 146908 ns 146689 ns 1.00
integration/byval/slices=3 425424 ns 424769 ns 1.00
integration/byval/reference 145136 ns 144911 ns 1.00
integration/byval/slices=2 285958 ns 285674 ns 1.00
integration/cudadevrt 103374 ns 103228 ns 1.00
kernel/indexing 14165.5 ns 13962 ns 1.01
kernel/indexing_checked 14847.5 ns 14556 ns 1.02
kernel/occupancy 634.1656804733727 ns 693.384105960265 ns 0.91
kernel/launch 2063.9 ns 2164.166666666667 ns 0.95
kernel/rand 14540.5 ns 14418 ns 1.01
array/reverse/1d 20127 ns 19581 ns 1.03
array/reverse/2d 24975 ns 24389 ns 1.02
array/reverse/1d_inplace 11254 ns 10606.666666666666 ns 1.06
array/reverse/2d_inplace 13089 ns 11144 ns 1.17
array/copy 20876 ns 20336 ns 1.03
array/iteration/findall/int 156012 ns 156856.5 ns 0.99
array/iteration/findall/bool 134582 ns 135569 ns 0.99
array/iteration/findfirst/int 153291 ns 153474.5 ns 1.00
array/iteration/findfirst/bool 152343 ns 152950 ns 1.00
array/iteration/scalar 59300 ns 60882 ns 0.97
array/iteration/logical 202346.5 ns 202672 ns 1.00
array/iteration/findmin/1d 37654 ns 37856 ns 0.99
array/iteration/findmin/2d 93622 ns 93737 ns 1.00
array/reductions/reduce/1d 36564.5 ns 38166 ns 0.96
array/reductions/reduce/2d 51306 ns 51122 ns 1.00
array/reductions/mapreduce/1d 33306 ns 31151.5 ns 1.07
array/reductions/mapreduce/2d 46557.5 ns 49629.5 ns 0.94
array/broadcast 20821 ns 21225 ns 0.98
array/copyto!/gpu_to_gpu 13362 ns 13324 ns 1.00
array/copyto!/cpu_to_gpu 208863 ns 208348.5 ns 1.00
array/copyto!/gpu_to_cpu 241484 ns 241560 ns 1.00
array/accumulate/1d 108701 ns 108467 ns 1.00
array/accumulate/2d 80028 ns 79962 ns 1.00
array/construct 1279.6 ns 1342.7 ns 0.95
array/random/randn/Float32 43235 ns 43560.5 ns 0.99
array/random/randn!/Float32 26476 ns 26195 ns 1.01
array/random/rand!/Int64 27068 ns 27079 ns 1.00
array/random/rand!/Float32 8805 ns 8700 ns 1.01
array/random/rand/Int64 30244 ns 29827 ns 1.01
array/random/rand/Float32 13382 ns 12930 ns 1.03
array/permutedims/4d 61027 ns 67316 ns 0.91
array/permutedims/2d 55247 ns 56600 ns 0.98
array/permutedims/3d 56104 ns 59248 ns 0.95
array/sorting/1d 2775155 ns 2764861 ns 1.00
array/sorting/by 3365800 ns 3352588 ns 1.00
array/sorting/2d 1083397 ns 1080760 ns 1.00
cuda/synchronization/stream/auto 1022.4 ns 1111.7 ns 0.92
cuda/synchronization/stream/nonblocking 6309.4 ns 6387.8 ns 0.99
cuda/synchronization/stream/blocking 800.8117647058823 ns 831.395061728395 ns 0.96
cuda/synchronization/context/auto 1171.5 ns 1212.1 ns 0.97
cuda/synchronization/context/nonblocking 6492.2 ns 6586.8 ns 0.99
cuda/synchronization/context/blocking 912.7941176470588 ns 916.775 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

lib/cublas/wrappers.jl Outdated Show resolved Hide resolved
@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 11, 2025 via email

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 11, 2025

Is the test failure something I've done? Seems GPUArrays related

@kshyatt kshyatt force-pushed the ksh/device_side branch 2 times, most recently from a0829fa to 5d52d10 Compare January 16, 2025 16:05
@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 16, 2025

OK, I think this is ready for review!

@Jutho
Copy link
Contributor

Jutho commented Jan 16, 2025

I am not qualified to review, but certainly interested in the outcome. Will the non-blocking methods only accept CuRef objects for the scalar input or output quantities, or also zero-dimensional arrays (i.e. CuArray{T,0})?

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 16, 2025 via email

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 16, 2025

You can create a CuRefArray{T} where T is some element type from a single element CuVector. In fact, CuRef itself does this under the hood.

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should also improve CuRef to initialize its memory by calling fill instead of memcpy: When calling memcpy, the copy likely won't be truly asynchronous (that would require pinned memory). But if we call fill, which should be possible for most scalars, the argument is passed by value and I think the call will complete asynchronously.
Something to investigate!

Comment on lines 130 to 131
α = convert(T, alpha)
gpu_α = CuRef{T}(α)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The converts can go, CuRef does that for you:

julia> CuRef{Float32}(1)
CUDA.CuRefArray{Float32, CuArray{Float32, 1, CUDA.DeviceMemory}}(Float32[1.0], 1)

α = convert(T, alpha)
gpu_α = CuRef{T}(α)
scal!(n, gpu_α, x)
synchronize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the synchronization? The only way to see the changes by this call is to fetch memory, which is a synchronizing operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For scal! and other functions which don't return a scalar result, I added this to keep the previous behaviour (so that the entire call is synchronous). I'll remove the sync for things like nrm2! that return a scalar which is copied back anyway.

lib/cublas/wrappers.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member

maleadt commented Jan 17, 2025

Something to investigate!

#2625

github-actions[bot]

This comment was marked as off-topic.

Copy link
Contributor

github-actions bot commented Jan 20, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/lib/cublas/wrappers.jl b/lib/cublas/wrappers.jl
index 24f414af0..ea213ff66 100644
--- a/lib/cublas/wrappers.jl
+++ b/lib/cublas/wrappers.jl
@@ -115,8 +115,9 @@ for (fname, fname_64, elty) in ((:cublasDscal_v2, :cublasDscal_v2_64, :Float64),
                                 (:cublasCscal_v2, :cublasCscal_v2_64, :ComplexF32))
     @eval begin
         function scal!(n::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
-                       x::StridedCuVecOrDenseMat{$elty}) where {M <: AbstractMemory}
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                x::StridedCuVecOrDenseMat{$elty}
+            ) where {M <: AbstractMemory}
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, alpha, x, stride(x, 1))
             else
@@ -147,8 +148,9 @@ for (fname, fname_64, elty, celty) in ((:cublasCsscal_v2, :cublasCsscal_v2_64, :
                                        (:cublasZdscal_v2, :cublasZdscal_v2_64, :Float64, :ComplexF64))
     @eval begin
         function scal!(n::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
-                       x::StridedCuVecOrDenseMat{$celty}) where {M<:AbstractMemory}
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                x::StridedCuVecOrDenseMat{$celty}
+            ) where {M <: AbstractMemory}
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, alpha, x, stride(x, 1))
             else
@@ -190,9 +192,9 @@ for (jname, fname, fname_64, elty) in ((:dot, :cublasDdot_v2, :cublasDdot_v2_64,
     @eval begin
         function $jname(n::Integer,
                         x::StridedCuVecOrDenseMat{$elty},
-                        y::StridedCuVecOrDenseMat{$elty},
-                        result::CuRefArray{$elty, CuVector{$elty, M}},
-            ) where {M<:AbstractMemory}
+                y::StridedCuVecOrDenseMat{$elty},
+                result::CuRefArray{$elty, CuVector{$elty, M}},
+            ) where {M <: AbstractMemory}
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, x, stride(x, 1), y, stride(y, 1), result)
             else
@@ -236,7 +238,7 @@ function dotu(
     return result[]
 end
 
-function dot(n::Integer, x::StridedCuVecOrDenseMat{Float16}, y::StridedCuVecOrDenseMat{Float16}, result::CuRefArray{Float16, CuVector{Float16, M}}) where {M<:AbstractMemory}
+function dot(n::Integer, x::StridedCuVecOrDenseMat{Float16}, y::StridedCuVecOrDenseMat{Float16}, result::CuRefArray{Float16, CuVector{Float16, M}}) where {M <: AbstractMemory}
     cublasDotEx(handle(), n, x, Float16, stride(x, 1), y, Float16, stride(y, 1), result, Float16, Float32)
     return result
 end
@@ -263,7 +265,7 @@ for (fname, fname_64, elty, ret_type) in ((:cublasDnrm2_v2, :cublasDnrm2_v2_64,
         function nrm2(n::Integer,
                 X::StridedCuVecOrDenseMat{$elty},
                 result::CuRefArray{$ret_type, CuVector{$ret_type, M}},
-            ) where {M<:AbstractMemory}
+            ) where {M <: AbstractMemory}
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, X, stride(X, 1), result)
             else
@@ -339,9 +341,10 @@ for (fname, fname_64, elty) in ((:cublasDaxpy_v2, :cublasDaxpy_v2_64, :Float64),
                                 (:cublasCaxpy_v2, :cublasCaxpy_v2_64, :ComplexF32))
     @eval begin
         function axpy!(n::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        dx::StridedCuVecOrDenseMat{$elty},
-                       dy::StridedCuVecOrDenseMat{$elty}) where {M <: AbstractMemory}
+                dy::StridedCuVecOrDenseMat{$elty}
+            ) where {M <: AbstractMemory}
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, alpha, dx, stride(dx, 1), dy, stride(dy, 1))
             else
@@ -399,9 +402,9 @@ for (fname, fname_64, elty, cty, sty) in (
         function rot!(n::Integer,
                       x::StridedCuVecOrDenseMat{$elty},
                       y::StridedCuVecOrDenseMat{$elty},
-                      c::CuRefArray{$cty, CuVector{$cty, M}},
-                      s::CuRefArray{$sty, CuVector{$sty, M}},
-                     ) where {M <: AbstractMemory}
+                c::CuRefArray{$cty, CuVector{$cty, M}},
+                s::CuRefArray{$sty, CuVector{$sty, M}},
+            ) where {M <: AbstractMemory}
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, x, stride(x, 1), y, stride(y, 1), c, s)
             else
@@ -472,9 +475,9 @@ for (fname, fname_64, elty) in ((:cublasIdamax_v2, :cublasIdamax_v2_64, :Float64
                                 (:cublasIcamax_v2, :cublasIcamax_v2_64, :ComplexF32))
     @eval begin
         function iamax(n::Integer,
-                       dx::StridedCuVecOrDenseMat{$elty},
-                       result::CuRefArray{Ti, CuVector{Ti, M}},
-                      ) where {Ti <: Integer, M <: AbstractMemory}
+                dx::StridedCuVecOrDenseMat{$elty},
+                result::CuRefArray{Ti, CuVector{Ti, M}},
+            ) where {Ti <: Integer, M <: AbstractMemory}
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, dx, stride(dx, 1), result)
             else
@@ -493,9 +496,9 @@ for (fname, fname_64, elty) in ((:cublasIdamin_v2, :cublasIdamin_v2_64, :Float64
                                 (:cublasIcamin_v2, :cublasIcamin_v2_64, :ComplexF32))
     @eval begin
         function iamin(n::Integer,
-                       dx::StridedCuVecOrDenseMat{$elty},
-                       result::CuRefArray{Ti, CuVector{Ti, M}},
-                      ) where {Ti <: Integer, M <: AbstractMemory}
+                dx::StridedCuVecOrDenseMat{$elty},
+                result::CuRefArray{Ti, CuVector{Ti, M}},
+            ) where {Ti <: Integer, M <: AbstractMemory}
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, dx, stride(dx, 1), result)
             else
@@ -529,11 +532,12 @@ for (fname, fname_64, elty) in ((:cublasDgemv_v2, :cublasDgemv_v2_64, :Float64),
                                 (:cublasCgemv_v2, :cublasCgemv_v2_64, :ComplexF32))
     @eval begin
         function gemv!(trans::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       y::StridedCuVector{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                y::StridedCuVector{$elty}
+            ) where {M <: AbstractMemory}
             # handle trans
             m,n = size(A)
             # check dimensions
@@ -558,10 +562,10 @@ function gemv!(trans::Char, alpha::Number, A::StridedCuMatrix{T}, x::StridedCuVe
     synchronize()
     return y
 end
-function gemv(trans::Char, alpha::CuRefArray{T, CuVector{T, M}}, A::StridedCuMatrix{T}, x::StridedCuVector{T}) where {T, M<:AbstractMemory}
+function gemv(trans::Char, alpha::CuRefArray{T, CuVector{T, M}}, A::StridedCuMatrix{T}, x::StridedCuVector{T}) where {T, M <: AbstractMemory}
     return gemv!(trans, alpha, A, x, CuRef{T}(zero(T)), similar(x, size(A, (trans == 'N' ? 1 : 2))))
 end
-function gemv(trans::Char, alpha::Number, A::StridedCuMatrix{T}, x::StridedCuVector{T}) where T
+function gemv(trans::Char, alpha::Number, A::StridedCuMatrix{T}, x::StridedCuVector{T}) where {T}
     gemv!(trans, alpha, A, x, zero(T), similar(x, size(A, (trans == 'N' ? 1 : 2))))
 end
 # should this be async?
@@ -579,12 +583,12 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
     )
     @eval begin
         function gemv_batched!(trans::Char,
-                               alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, M}},
-                               A::Vector{<:StridedCuMatrix{$eltyin}},
-                               x::Vector{<:StridedCuVector{$eltyin}},
-                               beta::CuRefArray{$eltyconst, CuVector{$eltyconst, M}},
-                               y::Vector{<:StridedCuVector{$eltyout}}
-                              ) where {M<:AbstractMemory}
+                alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, M}},
+                A::Vector{<:StridedCuMatrix{$eltyin}},
+                x::Vector{<:StridedCuVector{$eltyin}},
+                beta::CuRefArray{$eltyconst, CuVector{$eltyconst, M}},
+                y::Vector{<:StridedCuVector{$eltyout}}
+            ) where {M <: AbstractMemory}
             if length(A) != length(x) || length(A) != length(y)
                 throw(DimensionMismatch("Lengths of inputs must be the same"))
             end
@@ -615,13 +619,13 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
             y
         end
         function gemv_batched!(
-                               trans::Char,
-                               alpha::Number,
-                               A::Vector{<:StridedCuMatrix{$eltyin}},
-                               x::Vector{<:StridedCuVector{$eltyin}},
-                               beta::Number,
-                               y::Vector{<:StridedCuVector{$eltyout}}
-                              )
+                trans::Char,
+                alpha::Number,
+                A::Vector{<:StridedCuMatrix{$eltyin}},
+                x::Vector{<:StridedCuVector{$eltyin}},
+                beta::Number,
+                y::Vector{<:StridedCuVector{$eltyout}}
+            )
             gpu_α = CuRef{$eltyconst}(alpha)
             gpu_β = CuRef{$eltyconst}(beta)
             y = gemv_batched!(trans, gpu_α, A, x, gpu_β, y)
@@ -641,12 +645,12 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
     )
     @eval begin
         function gemv_strided_batched!(trans::Char,
-                                       alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, M}},
-                                       A::AbstractArray{$eltyin, 3},
-                                       x::AbstractArray{$eltyin, 2},
-                                       beta::CuRefArray{$eltyconst, CuVector{$eltyconst, M}},
-                                       y::AbstractArray{$eltyout, 2}
-                                      ) where {M<:AbstractMemory}
+                alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, M}},
+                A::AbstractArray{$eltyin, 3},
+                x::AbstractArray{$eltyin, 2},
+                beta::CuRefArray{$eltyconst, CuVector{$eltyconst, M}},
+                y::AbstractArray{$eltyout, 2}
+            ) where {M <: AbstractMemory}
             if size(A, 3) != size(x, 2) || size(A, 3) != size(y, 2)
                 throw(DimensionMismatch("Batch sizes must be equal for all inputs"))
             end
@@ -671,13 +675,13 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
             y
         end
         function gemv_strided_batched!(
-                                       trans::Char,
-                                       alpha::Number,
-                                       A::AbstractArray{$eltyin, 3},
-                                       x::AbstractArray{$eltyin, 2},
-                                       beta::Number,
-                                       y::AbstractArray{$eltyout, 2}
-                                      )
+                trans::Char,
+                alpha::Number,
+                A::AbstractArray{$eltyin, 3},
+                x::AbstractArray{$eltyin, 2},
+                beta::Number,
+                y::AbstractArray{$eltyout, 2}
+            )
             gpu_α = CuRef{$eltyconst}(alpha)
             gpu_β = CuRef{$eltyconst}(beta)
             y = gemv_strided_batched!(trans, gpu_α, A, x, gpu_β, y)
@@ -697,11 +701,12 @@ for (fname, fname_64, elty) in ((:cublasDgbmv_v2, :cublasDgbmv_v2_64, :Float64),
                        m::Integer,
                        kl::Integer,
                        ku::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       y::StridedCuVector{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                y::StridedCuVector{$elty}
+            ) where {M <: AbstractMemory}
             n = size(A,2)
             # check dimensions
             length(x) == (trans == 'N' ? n : m) && length(y) == (trans == 'N' ? m : n) || throw(DimensionMismatch(""))
@@ -716,16 +721,17 @@ for (fname, fname_64, elty) in ((:cublasDgbmv_v2, :cublasDgbmv_v2_64, :Float64),
             end
             y
         end
-        function gbmv!(trans::Char,
-                       m::Integer,
-                       kl::Integer,
-                       ku::Integer,
-                       alpha::Number,
-                       A::StridedCuMatrix{$elty},
-                       x::StridedCuVector{$elty},
-                       beta::Number,
-                       y::StridedCuVector{$elty}
-                      )
+        function gbmv!(
+                trans::Char,
+                m::Integer,
+                kl::Integer,
+                ku::Integer,
+                alpha::Number,
+                A::StridedCuMatrix{$elty},
+                x::StridedCuVector{$elty},
+                beta::Number,
+                y::StridedCuVector{$elty}
+            )
 
             gpu_α = CuRef{$elty}(alpha)
             gpu_β = CuRef{$elty}(beta)
@@ -735,8 +741,10 @@ for (fname, fname_64, elty) in ((:cublasDgbmv_v2, :cublasDgbmv_v2_64, :Float64),
         end
     end
 end
-function gbmv(trans::Char, m::Integer, kl::Integer, ku::Integer, alpha::CuRefArray{T, CuVector{T, M}},
-              A::StridedCuMatrix{T}, x::StridedCuVector{T}) where {T, M<:AbstractMemory}
+function gbmv(
+        trans::Char, m::Integer, kl::Integer, ku::Integer, alpha::CuRefArray{T, CuVector{T, M}},
+        A::StridedCuMatrix{T}, x::StridedCuVector{T}
+    ) where {T, M <: AbstractMemory}
     # TODO: fix gbmv bug in julia
     n = size(A, 2)
     leny = trans == 'N' ? m : n
@@ -759,11 +767,12 @@ for (fname, fname_64, elty) in ((:cublasDspmv_v2, :cublasDspmv_v2_64, :Float64),
                                 (:cublasSspmv_v2, :cublasSspmv_v2_64, :Float32))
     @eval begin
         function spmv!(uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        AP::StridedCuVector{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       y::StridedCuVector{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                y::StridedCuVector{$elty}
+            ) where {M <: AbstractMemory}
             n = round(Int, (sqrt(8*length(AP))-1)/2)
             if n != length(x) || n != length(y) throw(DimensionMismatch("")) end
             incx = stride(x,1)
@@ -777,21 +786,24 @@ for (fname, fname_64, elty) in ((:cublasDspmv_v2, :cublasDspmv_v2_64, :Float64),
         end
     end
 end
-function spmv!(uplo::Char,
-               alpha::Number,
-               AP::StridedCuVector{T},
-               x::StridedCuVector{T},
-               beta::Number,
-               y::StridedCuVector{T}
-              ) where {T}
+function spmv!(
+        uplo::Char,
+        alpha::Number,
+        AP::StridedCuVector{T},
+        x::StridedCuVector{T},
+        beta::Number,
+        y::StridedCuVector{T}
+    ) where {T}
     gpu_α = CuRef{T}(alpha)
     gpu_β = CuRef{T}(beta)
     y = spmv!(uplo, gpu_α, AP, x, gpu_β, y)
     synchronize()
     return y
 end
-function spmv(uplo::Char, alpha::CuRefArray{T, CuVector{T, M}},
-              AP::StridedCuVector{T}, x::StridedCuVector{T}) where {T, M<:AbstractMemory}
+function spmv(
+        uplo::Char, alpha::CuRefArray{T, CuVector{T, M}},
+        AP::StridedCuVector{T}, x::StridedCuVector{T}
+    ) where {T, M <: AbstractMemory}
     return spmv!(uplo, alpha, AP, x, CuRef{T}(zero(T)), similar(x))
 end
 function spmv(uplo::Char, alpha::Number,
@@ -810,11 +822,12 @@ for (fname, fname_64, elty) in ((:cublasDsymv_v2, :cublasDsymv_v2_64, :Float64),
     # Note that the complex symv are not BLAS but auiliary functions in LAPACK
     @eval begin
         function symv!(uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       y::StridedCuVector{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                y::StridedCuVector{$elty}
+            ) where {M <: AbstractMemory}
             m, n = size(A)
             if m != n throw(DimensionMismatch("Matrix A is $m by $n but must be square")) end
             if m != length(x) || m != length(y) throw(DimensionMismatch("")) end
@@ -847,7 +860,7 @@ end
 function symv(
         uplo::Char, alpha::CuRefArray{T, CuVector{T, M}},
         A::StridedCuMatrix{T}, x::StridedCuVector{T}
-    ) where {T, M<:AbstractMemory}
+    ) where {T, M <: AbstractMemory}
     return symv!(uplo, alpha, A, x, CuRef{T}(zero(T)), similar(x))
 end
 function symv(uplo::Char, alpha::Number,
@@ -864,11 +877,12 @@ for (fname, fname_64, elty) in ((:cublasZhemv_v2, :cublasZhemv_v2_64, :ComplexF6
                                 (:cublasChemv_v2, :cublasChemv_v2_64, :ComplexF32))
     @eval begin
         function hemv!(uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       y::StridedCuVector{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                y::StridedCuVector{$elty}
+            ) where {M <: AbstractMemory}
             # TODO: fix dimension check bug in julia
             m, n = size(A)
             if m != n throw(DimensionMismatch("Matrix A is $m by $n but must be square")) end
@@ -902,7 +916,7 @@ end
 function hemv(
         uplo::Char, alpha::CuRefArray{T, CuVector{T, M}},
         A::StridedCuMatrix{T}, x::StridedCuVector{T}
-    ) where {T, M<:AbstractMemory}
+    ) where {T, M <: AbstractMemory}
     return hemv!(uplo, alpha, A, x, CuRef{T}(zero(T)), similar(x))
 end
 function hemv(uplo::Char, alpha::Number, A::StridedCuMatrix{T},
@@ -922,11 +936,12 @@ for (fname, fname_64, elty) in ((:cublasDsbmv_v2, :cublasDsbmv_v2_64, :Float64),
     @eval begin
         function sbmv!(uplo::Char,
                        k::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       y::StridedCuVector{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                y::StridedCuVector{$elty}
+            ) where {M <: AbstractMemory}
             m, n = size(A)
             #if m != n throw(DimensionMismatch("Matrix A is $m by $n but must be square")) end
             if !(1<=(1+k)<=n) throw(DimensionMismatch("Incorrect number of bands")) end
@@ -962,7 +977,7 @@ end
 function sbmv(
         uplo::Char, k::Integer, alpha::CuRefArray{T, CuVector{T, M}},
         A::StridedCuMatrix{T}, x::StridedCuVector{T}
-    ) where {T, M<:AbstractMemory}
+    ) where {T, M <: AbstractMemory}
     return sbmv!(uplo, k, alpha, A, x, CuRef{T}(zero(T)), similar(x))
 end
 function sbmv(uplo::Char, k::Integer, alpha::Number,
@@ -981,11 +996,12 @@ for (fname, fname_64, elty) in ((:cublasZhbmv_v2, :cublasZhbmv_v2_64, :ComplexF6
     @eval begin
         function hbmv!(uplo::Char,
                        k::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       y::StridedCuVector{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                y::StridedCuVector{$elty}
+            ) where {M <: AbstractMemory}
             m, n = size(A)
             if !(1<=(1+k)<=n) throw(DimensionMismatch("Incorrect number of bands")) end
             if m < 1+k throw(DimensionMismatch("Array A has fewer than 1+k rows")) end
@@ -1020,7 +1036,7 @@ end
 function hbmv(
         uplo::Char, k::Integer, alpha::CuRefArray{T, CuVector{T, M}},
         A::StridedCuMatrix{T}, x::StridedCuVector{T}
-    ) where {T, M<:AbstractMemory}
+    ) where {T, M <: AbstractMemory}
     return hbmv!(uplo, k, alpha, A, x, CuRef{T}(zero(T)), similar(x))
 end
 function hbmv(uplo::Char, k::Integer, alpha::Number,
@@ -1168,10 +1184,11 @@ for (fname, fname_64, elty) in ((:cublasDger_v2, :cublasDger_v2_64, :Float64),
                                 (:cublasCgerc_v2, :cublasCgerc_v2_64, :ComplexF32))
     @eval begin
         function ger!(
-                      alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                       x::StridedCuVector{$elty},
                       y::StridedCuVector{$elty},
-                      A::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                A::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             m, n = size(A)
             m == length(x) || throw(DimensionMismatch(""))
             n == length(y) || throw(DimensionMismatch(""))
@@ -1204,9 +1221,10 @@ for (fname, fname_64, elty) in ((:cublasDspr_v2, :cublasDspr_v2_64, :Float64),
                                 (:cublasSspr_v2, :cublasSspr_v2_64, :Float32))
     @eval begin
         function spr!(uplo::Char,
-                      alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                       x::StridedCuVector{$elty},
-                      AP::StridedCuVector{$elty}) where {M<:AbstractMemory}
+                AP::StridedCuVector{$elty}
+            ) where {M <: AbstractMemory}
             n = round(Int, (sqrt(8*length(AP))-1)/2)
             length(x) == n || throw(DimensionMismatch("Length of vector must be the same as the matrix dimensions"))
             incx = stride(x,1)
@@ -1238,9 +1256,10 @@ for (fname, fname_64, elty) in ((:cublasDsyr_v2, :cublasDsyr_v2_64, :Float64),
                                 (:cublasCsyr_v2, :cublasCsyr_v2_64, :ComplexF32))
     @eval begin
         function syr!(uplo::Char,
-                      alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                       x::StridedCuVector{$elty},
-                      A::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                A::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             m, n = size(A)
             m == n || throw(DimensionMismatch("Matrix A is $m by $n but must be square"))
             length(x) == n || throw(DimensionMismatch("Length of vector must be the same as the matrix dimensions"))
@@ -1274,9 +1293,10 @@ for (fname, fname_64, elty, relty) in (
     )
     @eval begin
         function her!(uplo::Char,
-                      alpha::CuRefArray{$relty, CuVector{$relty, M}},
+                alpha::CuRefArray{$relty, CuVector{$relty, M}},
                       x::StridedCuVector{$elty},
-                      A::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                A::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             m, n = size(A)
             m == n || throw(DimensionMismatch("Matrix A is $m by $n but must be square"))
             length(x) == n || throw(DimensionMismatch("Length of vector must be the same as the matrix dimensions"))
@@ -1308,11 +1328,11 @@ for (fname, fname_64, elty) in ((:cublasZher2_v2, :cublasZher2_v2_64, :ComplexF6
                                 (:cublasCher2_v2, :cublasCher2_v2_64, :ComplexF32))
     @eval begin
         function her2!(uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
-                       x::StridedCuVector{$elty},
-                       y::StridedCuVector{$elty},
-                       A::StridedCuMatrix{$elty}
-                      ) where {M<:AbstractMemory}
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                x::StridedCuVector{$elty},
+                y::StridedCuVector{$elty},
+                A::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             m, n = size(A)
             m == n || throw(DimensionMismatch("Matrix A is $m by $n but must be square"))
             length(x) == n || throw(DimensionMismatch("Length of vector must be the same as the matrix dimensions"))
@@ -1352,11 +1372,12 @@ for (fname, fname_64, elty) in ((:cublasDgemm_v2, :cublasDgemm_v2_64, :Float64),
     @eval begin
         function gemm!(transA::Char,
                        transB::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuVecOrMat{$elty},
                        B::StridedCuVecOrMat{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       C::StridedCuVecOrMat{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                C::StridedCuVecOrMat{$elty}
+            ) where {M <: AbstractMemory}
             m = size(A, transA == 'N' ? 1 : 2)
             k = size(A, transA == 'N' ? 2 : 1)
             n = size(B, transB == 'N' ? 2 : 1)
@@ -1393,7 +1414,7 @@ end
 function gemm(
         transA::Char, transB::Char, alpha::CuRefArray{T, CuVector{T, M}},
         A::StridedCuVecOrMat{T}, B::StridedCuVecOrMat{T}
-    ) where {T, M<:AbstractMemory}
+    ) where {T, M <: AbstractMemory}
     return gemm!(
         transA, transB, alpha, A, B, CuRef(zero(T)),
         similar(
@@ -1493,10 +1514,10 @@ function gemmExComputeType(TA, TB, TC, m, k, n)
 end
 
 function gemmEx!(transA::Char, transB::Char,
-                 @nospecialize(alpha::CuRefArray),
+        @nospecialize(alpha::CuRefArray),
                  @nospecialize(A::StridedCuVecOrMat),
                  @nospecialize(B::StridedCuVecOrMat),
-                 @nospecialize(beta::CuRefArray),
+        @nospecialize(beta::CuRefArray),
                  @nospecialize(C::StridedCuVecOrMat);
                  algo::cublasGemmAlgo_t=CUBLAS_GEMM_DEFAULT)
     m = size(A, transA == 'N' ? 1 : 2)
@@ -1551,10 +1572,10 @@ end
 
 # TODO for device mode pointers
 function gemmBatchedEx!(transA::Char, transB::Char,
-                 @nospecialize(alpha::CuRefArray),
+        @nospecialize(alpha::CuRefArray),
                  @nospecialize(A::Vector{<:StridedCuVecOrMat}),
                  @nospecialize(B::Vector{<:StridedCuVecOrMat}),
-                 @nospecialize(beta::CuRefArray),
+        @nospecialize(beta::CuRefArray),
                  @nospecialize(C::Vector{<:StridedCuVecOrMat});
                  algo::cublasGemmAlgo_t=CUBLAS_GEMM_DEFAULT)
     if length(A) != length(B) || length(A) != length(C)
@@ -1622,11 +1643,11 @@ function gemmBatchedEx!(
 end
 
 function gemmStridedBatchedEx!(
-                 transA::Char, transB::Char,
-                 @nospecialize(alpha::CuRefArray),
+        transA::Char, transB::Char,
+        @nospecialize(alpha::CuRefArray),
                  @nospecialize(A::AbstractArray{Ta, 3}),
                  @nospecialize(B::AbstractArray{Tb, 3}),
-                 @nospecialize(beta::CuRefArray),
+        @nospecialize(beta::CuRefArray),
                  @nospecialize(C::AbstractArray{Tc, 3});
                  algo::cublasGemmAlgo_t=CUBLAS_GEMM_DEFAULT) where {Ta, Tb, Tc}
     if size(A, 3) != size(B, 3) || size(A, 3) != size(C, 3)
@@ -1865,11 +1886,12 @@ for (fname, fname_64, elty) in ((:cublasDgemmBatched, :cublasDgemmBatched_64, :F
     @eval begin
         function gemm_batched!(transA::Char,
                                transB::Char,
-                               alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                                A::Vector{<:StridedCuMatrix{$elty}},
                                B::Vector{<:StridedCuMatrix{$elty}},
-                               beta::CuRefArray{$elty, CuVector{$elty, M}},
-                               C::Vector{<:StridedCuMatrix{$elty}}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                C::Vector{<:StridedCuMatrix{$elty}}
+            ) where {M <: AbstractMemory}
             if length(A) != length(B) || length(A) != length(C)
                 throw(DimensionMismatch(""))
             end
@@ -1948,11 +1970,12 @@ for (fname, fname_64, elty) in ((:cublasDgemmStridedBatched, :cublasDgemmStrided
     @eval begin
         function gemm_strided_batched!(transA::Char,
                                transB::Char,
-                               alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                                A::AbstractArray{$elty, 3}, # allow PermutedDimsArray
                                B::AbstractArray{$elty, 3},
-                               beta::CuRefArray{$elty, CuVector{$elty, M}},
-                               C::AbstractArray{$elty, 3}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                C::AbstractArray{$elty, 3}
+            ) where {M <: AbstractMemory}
            m = size(A, transA == 'N' ? 1 : 2)
            k = size(A, transA == 'N' ? 2 : 1)
            n = size(B, transB == 'N' ? 2 : 1)
@@ -2031,11 +2054,12 @@ for (fname, fname_64, elty) in ((:cublasDsymm_v2, :cublasDsymm_v2_64, :Float64),
     @eval begin
         function symm!(side::Char,
                        uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
                        B::StridedCuMatrix{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       C::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                C::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             k, nA = size(A)
             if k != nA throw(DimensionMismatch("Matrix A must be square")) end
             m = side == 'L' ? k : size(B,1)
@@ -2073,7 +2097,7 @@ end
 function symm(
         side::Char, uplo::Char, alpha::CuRefArray{T, CuVector{T, M}},
         A::StridedCuMatrix{T}, B::StridedCuMatrix{T}
-    ) where {T, M<:AbstractMemory}
+    ) where {T, M <: AbstractMemory}
     return symm!(side, uplo, alpha, A, B, CuRef{T}(zero(T)), similar(B))
 end
 function symm(side::Char, uplo::Char, alpha::Number,
@@ -2093,10 +2117,11 @@ for (fname, fname_64, elty) in ((:cublasDsyrk_v2, :cublasDsyrk_v2_64, :Float64),
     @eval begin
         function syrk!(uplo::Char,
                        trans::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuVecOrMat{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       C::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                C::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             mC, n = size(C)
             if mC != n throw(DimensionMismatch("C must be square")) end
             nn = size(A, trans == 'N' ? 1 : 2)
@@ -2146,11 +2171,12 @@ for (fname, fname_64, elty) in ((:cublasDsyrkx, :cublasDsyrkx_64, :Float64),
     @eval begin
         function syrkx!(uplo::Char,
                        trans::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuVecOrMat{$elty},
                        B::StridedCuVecOrMat{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       C::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                C::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             mC, n = size(C)
             if mC != n throw(DimensionMismatch("C must be square")) end
             nn = size(A, trans == 'N' ? 1 : 2)
@@ -2186,7 +2212,7 @@ end
 function syrkx(
         uplo::Char, trans::Char, alpha::CuRefArray{T, CuVector{T, M}}, A::StridedCuVecOrMat{T},
         beta::CuRefArray{T, CuVector{T}}, B::StridedCuVecOrMat{T}
-    ) where {T, M<:AbstractMemory}
+    ) where {T, M <: AbstractMemory}
     n = size(A, trans == 'N' ? 1 : 2)
     return syrkx!(uplo, trans, alpha, A, B, beta, similar(A, (n, n)))
 end
@@ -2205,11 +2231,12 @@ for (fname, fname_64, elty) in ((:cublasZhemm_v2, :cublasZhemm_v2_64, :ComplexF6
     @eval begin
         function hemm!(side::Char,
                        uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
                        B::StridedCuMatrix{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
-                       C::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                C::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             mA, nA = size(A)
             m, n = size(B)
             mC, nC = size(C)
@@ -2247,7 +2274,7 @@ end
 function hemm(
         uplo::Char, trans::Char, alpha::CuRefArray{T, CuVector{T, M}},
         A::StridedCuMatrix{T}, B::StridedCuMatrix{T}
-    ) where {T, M<:AbstractMemory}
+    ) where {T, M <: AbstractMemory}
     m, n = size(B)
     return hemm!(uplo, trans, alpha, A, B, CuRef{T}(zero(T)), similar(B, (m, n)))
 end
@@ -2268,10 +2295,11 @@ for (fname, fname_64, elty, relty) in (
     @eval begin
         function herk!(uplo::Char,
                        trans::Char,
-                       alpha::CuRefArray{$relty, CuVector{$relty, M}},
+                alpha::CuRefArray{$relty, CuVector{$relty, M}},
                        A::StridedCuVecOrMat{$elty},
-                       beta::CuRefArray{$relty, CuVector{$relty, M}},
-                       C::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$relty, CuVector{$relty, M}},
+                C::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             mC, n = size(C)
             if mC != n throw(DimensionMismatch("C must be square")) end
             nn = size(A, trans == 'N' ? 1 : 2)
@@ -2305,7 +2333,7 @@ for (fname, fname_64, elty, relty) in (
                 trans::Char,
                 alpha::CuRefArray{$relty, CuVector{$relty, M}},
                 A::StridedCuVecOrMat{$elty}
-            ) where {M<:AbstractMemory}
+            ) where {M <: AbstractMemory}
             n = size(A, trans == 'N' ? 1 : 2)
             return herk!(uplo, trans, alpha, A, CuRef{$relty}(zero($relty)), similar(A, (n, n)))
         end
@@ -2327,11 +2355,12 @@ for (fname, fname_64, elty) in ((:cublasDsyr2k_v2, :cublasDsyr2k_v2_64, :Float64
     @eval begin
         function syr2k!(uplo::Char,
                         trans::Char,
-                        alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                         A::StridedCuVecOrMat{$elty},
                         B::StridedCuVecOrMat{$elty},
-                        beta::CuRefArray{$elty, CuVector{$elty, M}},
-                        C::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
+                C::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             # TODO: check size of B in julia (syr2k!)
             m, n = size(C)
             if m != n throw(DimensionMismatch("C must be square")) end
@@ -2386,7 +2415,7 @@ function syr2k(uplo::Char,
                B::StridedCuVecOrMat)
     T = eltype(A)
     n = size(A, trans == 'N' ? 1 : 2)
-    syr2k!(uplo, trans, convert(T, alpha), A, B, zero(T), similar(A, T, (n, n)))
+    return syr2k!(uplo, trans, convert(T, alpha), A, B, zero(T), similar(A, T, (n, n)))
 end
 function syr2k(uplo::Char, trans::Char, A::StridedCuVecOrMat, B::StridedCuVecOrMat)
     syr2k(uplo, trans, one(eltype(A)), A, B)
@@ -2400,11 +2429,12 @@ for (fname, fname_64, elty, relty) in (
     @eval begin
         function her2k!(uplo::Char,
                         trans::Char,
-                        alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                         A::StridedCuVecOrMat{$elty},
                         B::StridedCuVecOrMat{$elty},
-                        beta::CuRefArray{$relty, CuVector{$relty, M}},
-                        C::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                beta::CuRefArray{$relty, CuVector{$relty, M}},
+                C::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             # TODO: check size of B in julia (her2k!)
             m, n = size(C)
             if m != n throw(DimensionMismatch("C must be square")) end
@@ -2447,7 +2477,7 @@ for (fname, fname_64, elty, relty) in (
                 alpha::CuRefArray{$elty, CuVector{$elty, M}},
                 A::StridedCuVecOrMat{$elty},
                 B::StridedCuVecOrMat{$elty},
-            ) where {M<:AbstractMemory}
+            ) where {M <: AbstractMemory}
             n = size(A, trans == 'N' ? 1 : 2)
             return her2k!(uplo, trans, alpha, A, B, CuRef{$relty}(zero($relty)), similar(A, (n, n)))
         end
@@ -2477,10 +2507,11 @@ for (mmname, smname, elty) in
                        uplo::Char,
                        transa::Char,
                        diag::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
                        B::StridedCuMatrix{$elty},
-                       C::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                C::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             m, n = size(B)
             mA, nA = size(A)
             # TODO: clean up error messages
@@ -2499,9 +2530,10 @@ for (mmname, smname, elty) in
                        uplo::Char,
                        transa::Char,
                        diag::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
-                       B::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                B::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             m, n = size(B)
             mA, nA = size(A)
             # TODO: clean up error messages
@@ -2564,9 +2596,10 @@ for (fname, fname_64, elty) in ((:cublasDtrsmBatched, :cublasDtrsmBatched_64, :F
                                uplo::Char,
                                transa::Char,
                                diag::Char,
-                               alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                                A::Vector{<:StridedCuMatrix{$elty}},
-                               B::Vector{<:StridedCuMatrix{$elty}}) where {M<:AbstractMemory}
+                B::Vector{<:StridedCuMatrix{$elty}}
+            ) where {M <: AbstractMemory}
             if length(A) != length(B)
                 throw(DimensionMismatch(""))
             end
@@ -2620,11 +2653,12 @@ for (fname, fname_64, elty) in ((:cublasDgeam, :cublasDgeam_64, :Float64),
     @eval begin
         function geam!(transa::Char,
                        transb::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, M}},
+                alpha::CuRefArray{$elty, CuVector{$elty, M}},
                        A::StridedCuMatrix{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, M}},
+                beta::CuRefArray{$elty, CuVector{$elty, M}},
                        B::StridedCuMatrix{$elty},
-                       C::StridedCuMatrix{$elty}) where {M<:AbstractMemory}
+                C::StridedCuMatrix{$elty}
+            ) where {M <: AbstractMemory}
             mA, nA = size(A)
             mB, nB = size(B)
             m, n = size(C)
@@ -2860,8 +2894,9 @@ for (fname, elty) in ((:cublasDgetriBatched, :Float64),
         end
 
         function getri_batched!(n, Aptrs::CuVector{CuPtr{$elty}},
-                                lda, Cptrs::CuVector{CuPtr{$elty}},ldc,
-                                pivotArray::CuArray{Cint})
+                lda, Cptrs::CuVector{CuPtr{$elty}}, ldc,
+                pivotArray::CuArray{Cint}
+            )
             batchSize = length(Aptrs)
             info = CuArray{Cint}(undef, batchSize)
             $fname(handle(), n, Aptrs, lda, pivotArray, Cptrs, ldc, info, batchSize)

@maleadt
Copy link
Member

maleadt commented Jan 20, 2025

CI failures seem relevant.

Feel free to ignore the formatter; I made it less spammy 😉

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 23, 2025

I really do not know what is up with the 1.11 failure, it looks alloc_cache related?

@maleadt
Copy link
Member

maleadt commented Jan 25, 2025

Rebase to get rid of CI failures?

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 25, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants